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PIECEWISE LINEAR REGULARIZED SOLUTION PATHS 
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IBM T. J. Watson Research Center and University of Michigan 

We consider the generic regularized optimization problem /3(A) = 
argming L(y, + AJ(/3). Efron, Hastie, Johnstone and Tibshirani 
[Ann. Statist. 32 (2004) 407-499] have shown that for the LASSO— 
that is, if L is squared error loss and J(/3) = ||/3||i is the l\ norm of 
P — the optimal coefflcient path is piecewise linear, that is, dj3{\)/d\ 
is piecewise constant. We derive a general characterization of the 
properties of (loss L, penalty J) pairs which give piecewise linear 
coefficient paths. Such pairs allow for efflcient generation of the full 
regularized coefficient paths. We investigate the nature of efficient 
path following algorithms which arise. We use our results to sug- 
gest robust versions of the LASSO for regression and classification, 
and to develop new, efficient algorithms for existing problems in the 
literature, including Mammen and van de Geer's locally adaptive re- 
gression splines. 

1. Introduction. Regularization is an essential component in modern 
data analysis, in particular when the number of predictors is large, pos- 
sibly larger than the number of observations, and nonregularized fitting is 
likely to give badly over-fitted and useless models. 

In this paper we consider the generic regularized optimization problem. 
The inputs we have are: 

• A training data sample X = (xi, . . . ,x„)^,y = (yi, . . . ,yn)^ ■, where Xj e W 
and yj € M for regression, yi G {±1} for two-class classification. 

• A convex nonnegative loss functional L : M" x M" M. 

• A convex nonnegative penalty functional J:MP — > M, with J(0) = 0. We 
will almost exclusively use J{(3) = \\(3\\q in this paper, that is, penalization 
of the iq norm of the coefficient vector. 
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We want to find 

(1) /3(A) = argminL(y,X/3) + AJ(/3), 

where A > is the regularization parameter; A = corresponds to no regu- 
larization, while hmA^oo/3(A) = 0. 

Many of the commonly used methods for data mining, machine learning 
and statistical modeling can be described as exact or approximate regu- 
larized optimization approaches. The obvious examples from the statistics 
literature are explicit regularized linear regression approaches, such as ridge 
regression [9] and the LASSO [17]. Both of these use squared error loss, but 
they differ in the penalty they impose on the coefficient vector /3: 

n 

(2) Ridge : /3(A) = minj^ly, - x^/?)^ + 

^ 1=1 

n 

(3) LASSO : /3(A) = min^(2/, - ^] I3f + A||/3||i. 

^ i=i 

Another example from the statistics literature is the penalized logistic re- 
gression model for classification, which is widely used in medical decision 
and credit scoring models: 

n 

m = minj] log(l + e-y^^'^P) + m\l 
^ i=i 

Many "modern" methods for machine learning and signal processing can 
also be cast in the framework of regularized optimization. For example, the 
regularized support vector machine [20] uses the hinge loss function and the 
^2-norm penalty: 

(4) /3(A) = minf](l - y,^J (3)+ + m\l 

' i=i 

where (•)+ is the positive part of the argument. Boosting [6] is a popular 
and highly successful method for iteratively building an additive model from 
a dictionary of "weak learners." In [15] we have shown that the AdaBoost 
algorithm approximately follows the path of the £i-regularized solutions to 
the exponential loss function e~^-^ as the regularizing parameter A decreases. 

In this paper, we concentrate our attention on (loss L, penalty J) pairings 
where the optimal path /3(A) is piecewise linear as a function of A, that is, 
3Ao = < Ai < • • • < Am = oo and 70, 71, ... , 7m-i ^ such that /3(A) = 
Pi'^k) + (A — Afc)7fc for Afc < A < A^+i. Such models are attractive because 
they allow us to generate the whole regularized path /3(A), < A < 00, simply 
by sequentially calculating the "step sizes" between each two consecutive A 
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values and the "directions" 71, . . . ,7m--i- Our discussion will concentrate on 
(L, J) pairs which allow efficient generation of the whole path and give 
statistically useful modeling tools. 

A canonical example is the LASSO (3). Recently [3] has shown that the 
piecewise linear coefficient paths property holds for the LASSO, and sug- 
gested the LAR-LASSO algorithm which takes advantage of it. Similar algo- 
rithms were suggested for the LASSO in [14] and for total- variation penalized 
squared error loss in [13]. We have extended some path-following ideas to 
versions of the regularized support vector machine [7, 21]. 

In this paper, we systematically investigate the usefulness of piecewise 
linear solution paths. We aim to combine efficient computational methods 
based on piecewise linear paths and statistical considerations in suggest- 
ing new algorithms for existing regularized problems and in defining new 
regularized problems. We tackle three main questions: 

1. What are the "families" of regularized problems that have the piece- 
wise linear property? The general answer to this question is that the loss 
L has to be a piecewise quadratic function and the penalty J has to be 
a piecewise linear function. We give some details and survey the resulting 
"piecewise linear toolbox" in Section 2. 

2. For what members of these families can we design efficient algo- 
rithms, either in the spirit of the LAR-LASSO algorithm or using different 
approaches? Our main focus in this paper is on direct extensions of LAR- 
LASSO to "almost-quadratic" loss functions (Section 3) and to nonpara- 
metric regression (Section 4). We briefly discuss some non-LAR type results 
for ii loss in Section 5. 

3. Out of the regularized problems we can thus solve efficiently, which 
ones are of statistical interest? This can be for two distinct reasons: 

(a) Regularized problems that are widely studied and used are obviously 
of interest, if we can offer new, efficient algorithms for solving them. In 
this paper we discuss in this context locally adaptive regression splines [13] 
(Section 4.1), quantile regression [11] and support vector machines (Section 
5). 

(b) Our efficient algorithms allow us to pose statistically motivated reg- 
ularized problems that have not been considered in the literature. In this 
context, we propose robust versions of the LASSO for regression and classi- 
fication (Section 3). 

2. The piecewise linear toolbox. For the coefficient paths to be piece- 
wise linear, we require that ^gjf'V ll II be a piecewise constant vector as 
a function of A. Using Taylor expansions of the normal equations for the min- 
imizing problem (1), we can show that if L,J are both twice differentiable 
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in the neighborhood of a solution /3(A), then 

(5) ^ = -[V'L0{X)) + AVV(^(A))]-iVJ(/3(A)), 

where we are using the notation L(/3(A)) in the obvious way, that is, we 
make the dependence on the data X,y (here assumed constant) imphcit. 

Proposition 1. A sufficient and necessary condition for the solution 
path to be linear at Aq when L, J are twice dijferentiable in a neighborhood 
of /3(Ao) is that 

(6) - [V^LiPiX)) + AV2 J(/3(A))]-iV J(/3(A)) 

is a proportional (i.e., constant up to multiplication by a scalar) vector in 
MP as a function of X in a neighborhood of Aq • 

Proposition 1 imphes sufficient conditions for piecewise hnearity: 

• -L is piecewise quadratic as a function of P along the optimal path /3(A), 
when X,y are assumed constant at their sample values; and 

• J is piecewise linear as a function of /3 along this path. 

We devote the rest of this paper to examining some families of regularized 
problems which comply with these conditions. 

On the loss side, this leads us to consider functions L which are: 

• Pure quadratic loss functions, like those of linear regression. 

• A mixture of quadratic and linear pieces, like Huber's loss [10]. These 
loss functions are of interest because they generate robust modeling tools. 
They will be the focus of Section 3. 

• Loss functions which are piecewise linear. These include several widely 
used loss functions, like the hinge loss of the support vector machine (4) 
and the check loss of quantile regression [11] 



where 

(7) l{y^,P'^^^ 



L(y,X/3)=^/(y,,/3Txi), 



T ■ (yj - /3^Xi), if yj - /3'^Xi > 0, 

{1 - t) ■ {(3~^ Xi - yi) , otherwise, 



and T G (0, 1) indicates the quantile of interest. 

On the penalty side, our results lead us to consider the £i and £oo penalties 
as building blocks for piecewise linear solution paths. For lack of space, we 
limit the discussion in this paper to the ii penalty and its variants (like total 
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variation penalties discussed in Section 4). Results on the ioo penalty can 
be found in a full technical report on the second author's homepage. 

ii regularization has several favorable statistical properties. Using ii reg- 
ularization results in "sparse" solutions with a relatively small fraction of 
nonzero coefficients, as opposed to £2 regularization which forces all nonzero 
coefficients [17]. In particular, if the number of predictors is larger than the 
number of observations {p> n), then for any A there exists an £i-regularized 
solution with at most n nonzero coefficients [15]. Thus, in situations where 
the number of relevant variables is small and there are a lot of irrelevant 
"noise" variables, ii regularization may prove far superior to £2 regulariza- 
tion from a prediction error perspective. From an inference/interpretation 
perspective, ii regularization gives "smooth" variable selection and more 
compact models than I2 regularization. In the case of orthogonal wavelet 
bases, the soft thresholding method proposed by [2], which is equivalent to 
li regularization, is asymptotically nearly optimal (in a minimax sense) over 
a wide variety of loss functions and estimated functions. 

It is not surprising, therefore, that ii regularization and its variants have 
been widely and successfully used in different fields, including engineering 
and signal processing (such as basis pursuit and wavelet thresholding), ma- 
chine learning (such as boosting and ii SVM) and, obviously, statistics, 
where ii and total variation penalties are prevalent. 

3. Almost quadratic loss functions with penalty. In this section, we 
first define a family of "almost quadratic" loss functions whose £i-penalized 
versions generate piecewise linear solution paths. We formulate and prove an 
algorithm, which is an extension of the LAR-LASSO algorithm, that gener- 
ates the £i-penalized solution paths for all members of this family. We then 
concentrate on two members of this family — Huberized LASSO for regres- 
sion and £i-penalized Huberized squared hinge loss for classification — which 
define new, robust, efficient and adaptable modeling tools. An R imple- 
mentation of these tools is available from the second author's homepage, 
www . st at . Isa. umich .edu/~jizhu/code /piecewise / . 

3.1. Main results. We fix the penalty to be the £1 penalty, 



j 

and the loss is required to be differentiable and piecewise quadratic in a 
fixed function of the sample response and the "prediction" P'^x, 



(9) L(y,X/3)=5^/(y„/5^x,), l{y,f3'^^)=a{ry + b{r)r + c{r), 



where r = {y — /3^x) is the residual for regression and r = {y/3~^ji.) is the 
margin for classification; and l{r) is a quadratic spline, that is, a(-), 6(-), c(-) 



(8) 
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are piecewise constant functions, defined so as to make the function I differ- 
entiable. 

Some examples from this family are: 

• The squared error l{y, /3'''x) = (y — /?'''x)^, that is, a = 1, 6 = 0, c = 0. 

• Huber's loss function with fixed knot t, 



(10) liy,P^^) 



(y-/3^x)2, if\y-(3^^\<t, 



2t\y — /3~''x| — t^, otherwise. 
• Squared hinge loss for classification, 
(11) Ky,/5^x) = (l-y/3Tx)^. 

Note that the hinge loss of the support vector machine and the check loss of 
quantile regression do not belong to this family as they are not differentiable 
at yfi^'x. = 1 and y — f5'^x. = 0, respectively. 

Theorem 2. All regularized problems of the form (1) using (8), (9) 
(with r being either the residual or the margin) generate piecewise linear 
optimal coefficient paths /3(A) as the regularization parameter A varies. 

Proof. We prove the theorem formally using the Karush-Kuhn-Tucker 
formulation of the optimization problem. 

We rewrite the regularized optimization problem as 

min ^Z(y„(/3+-/3-)Tx,) + A^(/3+ + /3-) 

subject to /3+ > 0, (3^" > Vj. 
The Lagrange primal function is 

Y^lim, (/3+ - /3-)^x,) + A^(/3+ + /37) - E ^tf^t - E ^Jf^i- 

i j j j 

The derivatives of the primal and the corresponding KKT conditions imply 
(VL(/3)),- + A - A+ = 0, -(VL(/3)), + A - A" = 0, 

A+/3+ = 0, A-/3-=0. 

Using these we can figure that at the optimal solution for fixed A the fol- 
lowing scenarios should hold: 

A = =^ (VL(/3))j = Vj (unconstrained solution), 

/?+ > 0, A > ^ A+ = ^ (VL(/3))j = -A < ^ AT > ^ /3r = 0, 

> 0, X > ^ I3j~ = (by similar reasoning), 
|(VL(/3))j| > A =^ contradiction. 
Based on these possible scenarios we can see that: 
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• Variables can have nonzero coefficients only if their "generalized absolute 
correlation" |VL(/3(A))j| is equal to A. Thus, for every value of A we have 
a set of "active" variables A= {j : f3j{\) ^ 0} such that 

(12) jG^^|VL(/3(A))j| = A, sgn(VL(/3(A))j) = -sgn(/3(A)j), 

(13) j(^A^\VLCm)j\<\. 

• When A changes, the direction in which /3(A) is moving, that is, 
should be such that it maintains the conditions (12), (13). 

So, if we know what the "active" set A is, it is a simple task to check that 
as long as we are in a region where the loss is twice differentiable and the 
penalty is right differentiable, we will have 

(14) = -{V'LimUr' sgn(/3(A)^), 

which is just a version of (5), limited to only the active variables and sub- 
stituting the £i penalty for J. 

For the family of almost quadratic loss functions, we can derive V^L(/3(A))^ 
explicitly, 

i 

Since a(-) is a piecewise constant function, then V^L(/3(A))_4 and 0/3(A)_4/9A 
are also piecewise constant; therefore, the solution path /3(A) is piecewise 
linear. 

When one of the following "events" occurs, twice differentiability is vio- 
lated and hence the direction in (14) will change: 

• Add a variable: A new variable should join A; that is, we reach a point 
where | VL(/3(A))_4C | < A will cease to hold if /3(A) keeps moving in the 
same direction. 

• Drop a variable: A coefficient in A hits 0. In that case, we reach a non- 
differentiability point in the penalty and we can see that sgn(VI/(/3(A))^) 
= — sgn(/?_4) will cease to hold if we continue in the same direction. Thus 
we need to drop the coefficient hitting from A. 

• Cross a knot: A "generalized residual" r{yi, (3{X)'^ji.i) hits a non-twice 
differentiability point (a "knot") in L, for example the "Huberizing" point 
t in (10), or the hinge point 1 in (11). 

So we conclude that the path /?(A) will be piecewise linear, with the direction 
given by (14) and direction changes occurring whenever one of the three 
events above happens. When it happens, we need to update A or a(r) to 
get a feasible scenario and recalculate the direction using (14). □ 
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Based on the arguments in the proof we can derive a generic algorithm to 
generate coefficient paths for all members of the "almost quadratic" family of 
loss functions with ii penalty. The LAR-LASSO algorithm [3] is a simplified 
version of this algorithm since "knot crossing" events do not occur in the 
LASSO (as the loss is twice differentiable) . Our algorithm starts at A = cxd 
and follows the linear pieces, while identifying the "events" and recalculating 
the direction when they occur. 

Algorithm 1 . An algorithm for "almost quadratic" loss with ii penalty. 

1. Initialize: 

P = 0, ^ = argmax|VL(/3)|j, 7^ = - sgn(VL(/3))^, 7.4^=0. 
j 

2. While (max|VL(/3)| >0): 

(a) di = min{(i>0:|VL(/3 + (i7)j| = |VL(/? + (i7)^|, j^A}, 
d2 = mm{d > : (/3 + ^7)^ = 0, j G A} (hit 0), 

ds = minjd > : r(?/j, (/? + (i7)^Xj) hits a "knot," i = 1, . . . , n}. 
Find step length: d = min((ii, 1^2, da). 

(b) Take step: /3 <— /3 + d'y. 

(c) If d = di then add variable attaining equality at d to A. 
Ii d = d2 then remove variable attaining at d from A. 

If d = ds for i*, then assign new a(r(yj«,/3'''xj*)) from (9). 

(d) Calculate new direction: 

i 

lA = C"^ ■ sgn(/3^) and 7^40 = 0. 

It should be noted that our formulation here of the "almost quadratic" 
family with ii penalty has ignored the existence of a nonpenalized intercept. 
This has been done for simplicity of exposition, however incorporating a 
nonpenalized intercept into the algorithm is straightforward. 

3.2. Computational considerations. What is the computational complex- 
ity of running Algorithm 1 on a dataset with n observations and p variables? 
The major computational cost for each step involves figuring out the step 
length in (2a), and updating the new direction in (2d). The former takes 
0{np) calculations, and the latter requires O(j^p) computations by using 
inverse updating and downdating. 

It is difficult to predict the number of steps on the solution path for 
arbitrary data. According to our experience, the total number of steps taken 
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by the algorithm is on average 0{n). This can be heuristicahy understood 
as follows. If n > p, it takes 0{p) steps to add all variables and 0{n) steps 
for knot crossing; if n < p, since at most n variables are allowed in the fitted 
model, it takes 0(n) steps for both adding variables and crossing knots; the 
"drop events" are usually rare, 0(1). Since the maximum value of \A\ is 
min(n,p), it suggests the overall computational cost is 0{ii?p). 

3.3. The Huberized LASSO for regression. We now concentrate on two 
members of the "almost quadratic" family of loss functions — one for regres- 
sion and one for classification. 

We first consider the Huberized LASSO for regression. The loss is given 
by (10). It is robust in the sense defined in [10], in that it protects against 
"contamination" of the assumed normal errors. It is "almost quadratic" as 
defined in Section 3.1, and so Theorem 2 and Algorithm 1 apply to its £i 
regularized solution paths. 

Prostate cancer dataset. We use the "prostate cancer" dataset [17] to 
compare the prediction performance of the Huberized LASSO to that of the 
LASSO on the original data and after we artificially "contaminate" the data 
by adding large constants to a small number of responses. 

We used the training-test split as in [8]. The training set consists of 67 
observations and the test set of 30 observations. We ran the LASSO and 
the Huberized LASSO with a knot at t = 1 on the original dataset, and 
on the "contaminated" dataset where 5 has been added/subtracted to the 
responses of 12 observations. 

Figure 1 shows the mean squared error on the 30 test set observations for 
the four resulting regularized solution paths from solving the LASSO and 
Huberized LASSO for all possible values of A on the two datasets. We ob- 
serve that on the noncontaminated data, the LASSO (solid) and Huberized 
LASSO (dashed) perform quite similarly. When we add contamination, the 
Huberized LASSO (dash-dotted) does not seem to suffer from it at all, in 
that its best test set performance is comparable to that of both regular- 
ized models on the noncontaminated data. The prediction performance of 
the standard LASSO (dotted), on the other hand, deteriorates significantly 
(t-test p-value 0.045) when contamination is added, illustrating the lack of 
robustness of squared error loss. 

The two LASSO solutions contain nine linear pieces each, while the Huber- 
LASSO path for the noncontaminated data contains 41 pieces, and the one 
for the contaminated data contains 39 pieces; both agree with our conjecture 
in Section 3.2 that the number of steps is 0(n). Figure 2 shows the solution 
paths for the contaminated LASSO model and the contaminated Huber- 
LASSO model. We observe that the two paths are quite different and the 
two best models (corresponding to the solid vertical lines) are also different. 
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Fig. 1. Test MSE of the models along the regularized paths. See text for details. 




1 2 3 4 0.0 0.5 1,0 1,5 2.0 2,5 3.0 

wmh wmh 

Fig. 2. Solution paths of the LASSO [left) and the Huberized LASSO {right) on the 
contaminated prostate cancer training data. The vertical grey lines correspond to the steps 
along the solution paths. The vertical solid lines correspond to the models that give the best 
performances on the test data. 
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3.4. The Huberized squared hinge loss for classification. For classification 
we would like to have a loss which is a function of the margin, r(?/,/?^x) = 
(y/3'''x). This is true of all loss functions typically used for classification, 
like the negative binomial log-likelihood for logistic regression, the hinge 
loss for the support vector machine and exponential loss for boosting. The 
properties we would like from our classification loss are: 

• We would like it to be "almost quadratic," so we can apply the Algorithm 
1 in Section 3.1. 

• We would like it to be robust, that is, linear for large absolute value 
negative margins (like the logistic or hinge), so that outliers will have a 
small effect on the fit. 

This leads us to suggest for classification the "Huberized squared hinge loss," 
that is, (11) "Huberized" at t < 1, 

r (1 - tf + 2(1 -t){t- 2//?Tx), if y/?"^x < t, 
(15) /(2/,/3^x) = <^ (1 - 2//?Tx)2, if t < y/J^x < 1, 

L 0, otherwise. 
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It is a simple task to show that 

argminEyZ(y, /) = 2Pr(y = 1) — 1. 

/ 

Hence the population minimizer of the Huberized squared hinge loss gives 
the correct sign for classification. 

To illustrate the robustness of this loss (15) and its computational su- 
periority over the logistic loss, we considered the following simple example: 
X e with class centers at (—1, —1) (class and (1, 1) (class "1") with 

one big outlier at (30,100) belonging to the class "—1." The Bayes model, 
ignoring the outlier, is to classify to class "1" if and only if xi + X2> 0. 

Figure 3 shows the regularized solution paths and misclassification rate 
for this example using the logistic loss (left), the Huberized squared hinge 
loss (middle) and the squared hinge loss (right), all with £i penalty. We 
observe that the logistic and Huberized regularized model paths are both 
less affected by the outlier than the non-Huberized squared loss. However, 
logistic loss does not allow for efficient calculation of the ii regularized path. 

4. Nonparametric regression, total variation penalties and piecewise lin- 
earity. Total variation penalties and closely associated spline methods for 
nonparametric regression have experienced a surge of interest in the statistics 
literature in recent years. The total variation of a univariate differentiable 
function /(x) is 

/oo 
\f'ix)\dx. 

If / is nondifferentiable on a countable set xi,X2,---, then TV{f) is the 
sum of TVdifif), calculated over the differentiable set only and the absolute 
"jumps" in / where it is noncontinuous. In what follows we assume the range 
of / is limited to [0, 1]. 

Total variation penalties tend to lead to regularized solutions which are 
polynomial splines. [13] investigates the solutions to total- variation penalized 
least squares problems. The authors use total variation of {k — l)st order 
derivatives, 

n 

(16) ^(y,-/(x,))VA-TF(/(^^-i)). 

i=l 

They show that there always exists a solution fi^ x such that fj!'^ is piece- 
wise constant, that is, fk^x is a polynomial spline of order k. For k G {1,2} 
the knots of the spline solutions are guaranteed to be at the data points. 

A similar setup is considered in [1]. Their taut-string and local squeezing 
methods lead to solutions that are polynomial splines of degree or 1, with 
knots at data points. 
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Now, consider a polynomial spline / of order k, with h knots located at 
0<ti<--- <th<l, that is, 

h 

(17) /(;^) = ^/5.(^_t.)^-i + g(^), 

i=i 

where q{x) is a polynomial of degree k — 1. The total variation of the {k — l)st 
derivative of / clearly corresponds to an ii norm of the set of coefficients of 
the appropriate spline basis functions, anchored at the knots, 

(18) Tv{f^'^-''^) = {k-iy..j2m- 

If the knots ti, . . . ,th are fixed in advance (e.g., at the data points), then a 
total variation penalized problem is equivalent to an ^i-penalized regression 
problem, with p = h derived predictors. If we also employ squared error loss, 
we get a LASSO problem, and we can use the LAR-LASSO algorithm to 
compute the complete regularized solution path. The only difference from 
the standard LASSO is the existence of k nonpenalized coefficients for the 
polynomial q{x), instead of the intercept only for the LASSO. This requires 
only a slight modification to the LAR-LASSO algorithm. This leads to es- 
sentially the same algorithm as Algorithm 2 of [13] for finding the regularized 
path for any k with a fixed, predetermined set of candidate knots. 

4.1. Locally adaptive regression splines. We now concentrate on the fam- 
ily of penalized problems (16) defined by Mammen and van de Geer [13]. 
As we have mentioned, [13] develops an exact method for finding fi^^x when 
k G {1,2} and approximate methods for k> 2 (where the knots of the op- 
timal solutions are not guaranteed to be at the data points). We now show 
how we can use our approach to find the spline solution fk^\ exactly for any 
natural k. The resulting algorithms get practically more complicated as k 
increases, but their theoretical computational complexity remains fixed. 

When the knots are not guaranteed to be at the data points, we can still 
write the total variation of polynomial splines as the sum of ii norms of 
coefficients of basis functions, as in (18). However, we do not have a finite 
predefined set of candidate basis functions. Rather, we are dealing with an 
infinite set of candidate basis functions of the form 

X = {{x-t)''+-^:0<t<l}. 

Our algorithm for tracking the regularized solution path in this case 
proceeds as follows. We start at the solution for A = oo, which is the least 
squares {k — l)st degree polynomial fit to the data. Given a solution fk^XQ 
for some value of Aq , which includes nx^ knots at ti , . . . , tn^^ , denote 

z(x) = {l,x,x ,...,x ,{x — ti)_^ , . . . , (x — )_|_ ) , 
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which is the current predictor vector. Following (17) we can write 

Following the logic of the LAR-LASSO algorithm, we see that the solution 
will change as 

fk,\o-d = (/3(Ao) + dj)~^z{x), 

where 7 = -(/c- 1)! • {Z^ Z)'^ ■ s, Z = (z(xi), . . . , z(x„))^ and s € M^'+"^o 
is a vector with components corresponding to l,x,. . . and ±1 com- 

ponents corresponding to each (x — tj)^^ [with the sign being the opposite 
of the sign of {x — tj)^^^ {y — fk,Xo)]- What we now need to identify is the 
value of A at which an additional knot needs to be added, and the location 
of that knot. Consider first a fixed knot candidate t. Then we can see that 
the LAR-LASSO criterion for adding this knot to the set of "active" knots 
is 

|x7(y-Z/3(Ao)-(Ao-A)Z7)|=A, 
where = (x — t)^~^ (column vector of length n). More explicitly, define 

(19) .^(^^ x^iy-^/^w-Aoz^) , 

1 - Xj' Z7 

(20) ^_(^,_x^y-Z«A„)-A„ZT) 



-i-x;z7 



Then we can write 

(21) A(t) : 



max(A+(t), A_(t)), if max(A+(i), A_(t)) < Aq, 
min(A+(t), A_(t)), if max(A+(i), A_(t)) > Aq. 

Now we see that we can in fact let t be a parameter and find the next 
knot to be added to the optimal solution path by maximizing A(t), that is, 

(22) Aadd= , ^max A(t), 

tG(0,l)\{ti,...,t„^J 

which is the value of A where we stop moving in direction 7, add a knot at 
the argument of the maximum and recalculate the direction 7. 

Solving (22) requires finding the local extrema of the functions in (21), 
which are rational functions within each interval between two points which 
are either data points or knots (with numerator and denominator both of 
degree A: — 1). Thus, a reasonable tactic is to find the extrema within each 
such interval, then compare them between the intervals to find the overall 
solution to (22). For smaller values of k it can be solved manually and 
exactly: 
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• For k € {1,2}, we get a ratio of constant or linear functions in (21), and 
therefore the extrema — and the knots — are guaranteed to be at the data 
points, leading to the algorithm of [13]. 

• For k = 3 we get a ratio of quadratics in (21), and we can find the extrema 
within each segment analytically. These extrema may not correspond to 
the segment's end points, and so we may have knots that are not at data 
points. 

Assuming we have the code to solve the maximization problem in (22), 
Algorithm 2 gives a general schema for following the solution path fk^\ for 
any value of k. 

Algorithm 2. Tracking the path of TV-penalized solutions. 

1. Initialize: 

f{x) = (l,x, . . . , x''~^)'^ Pis is the LS polynomial fit of degree k — 1, 

u = argmax|(x — t)^^~^ {y — /(x))| (assumed unique), 
te(o,i) 

T = M, Ao = (A;-l)!.|(x-n)t-iT(y_/(x))|, 
Z = (l,x, . . . ,x'=-\ (x - u)X-'), /3(Ao) = (AL 0)^, 
s = (0,^, - sgn{(x - u%-^^{y - /(x))})^. 

2. While Ei(yi-/(^0)'>0: 

(a) Set 7 = -(A;-l)!(ZTZ)-is. 

(b) Vt e (0, 1) \ r define A+(i), A_(t), A(i) as in (19)-(21). 

(c) Solve the maximum problem in (22) to get Aadd- 

(d) Let Arem = Aq — min{d > : 3j > A; s.t. Pj{Xo) + d'fj = 0}. 

(e) If Aadd > Arcm add a knot at the point attaining the maximum 
in (22), and update T, Z and s. 

(f) Similarly, if Aadd < -^rcm remove the knot attaining at Arem- 

(g) In both cases, update: 

/3(Ao) ^ /3(Ao) + (Ao - max(Aadd, Arem))7> 
Ao ^max(Aadd,A rem J ■ 

Since we can never have more than n — k knots in a solution fk^\ [15], 
the computational complexity of each iteration of the algorithm is bounded 
at O(n^) calculations for finding the next knot and O(n^) for calculating 
the next direction (using updating formulas). The number of steps of the 
algorithm is difficult to bound, but from our experience seems to behave like 
0{n) (which is the conjecture of [13]). 
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Fig. 4. Applying Algorithm 2 [with A; = 3) to a data example where the underlying curve 
is a quadratic spline with knots at 0.25, 0.5 and 0.75. See text for details. 

4.2. Simple data example: k = 3. We illustrate our algorithm on a simple 
data example. We select 100 x samples uniformly on (0,1). We draw the 
corresponding y values as N{g{x), 0.03'^), where g{x) is a polynomial spline 
with knots at 0.25,0.5 and 0.75, 

g{x) = 0.125 + 0.125X - + 2{x - 0.25)1 - 2{x - 0.5)1 + '^(^ " 0.75)^. 

g{x) is plotted as the solid line in Figure 4, and the noisy y values as circles. 
The signal-to-noise ratio is about 1.4. 

We apply our Algorithm 2 with k = 3. Figure 4 shows the resulting models 
after 5, 15 and 50 iterations of the algorithm. After 5 iterations, the regu- 
larized spline contains three knots like the true g, but these are all around 
0.5. The fitted model, drawn as a dashed curve, is clearly underfitted. The 
corresponding reducible squared prediction error is 9.5 x 10~^. 
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After 15 iterations, the spline contains four knots, at 0.225, 0.255, 0.485 
and 0.755. The first one has a small coefficient, and the other three closely 
correspond to the knots in g. The resulting fit (dotted curve) is a reasonable 
approximation of g, and the reducible squared error is about 3.1 x 10~^. 

After 50 iterations the model contains ten knots and the data is clearly 
overfitted (dash-dotted curve, reducible squared error 8.2 x 10~^). 

Although the algorithm should in principle continue until it interpolates 
the data, in practice it terminates before (in this case after about 180 itera- 
tions) and is numerically unable to further improve the fit. This is analogous 
to the situation described in [7] for kernel SVM, where the effective rank of 
the kernel matrix is significantly smaller than n, since many eigenvalues are 
effectively zero. 

5. Using £i loss and its variants. Piecewise linear nondifferentiable loss 
functions appear in practice in both regression and classification problems. 
For regression, absolute value loss variants like the quantile regression loss 
are quite popular [11]. For classification, the hinge loss is of great importance, 
as it is the loss underlying the support vector machine [20] . Here we consider 
a generalized formulation, which covers both of (7) and (4). The loss function 
has the form 



with the generalized "residual" being r = {y — /3 x) for regression and r = 
{y ■ P'^:x.) for classification. 

When these loss functions are combined with penalty (or total variation 
penalty, in appropriate function classes [12]), the resulting regularized prob- 
lems can be formulated as linear programming problems. When the path 
of regularized solutions /3(A) is considered, it turns out to have interesting 
structure with regard to A: 

Proposition 3. For loss functions of the form (23), there exists a set 
of values of the regularization parameter < Ai < • • • < Am = oo such that: 

• The solution (3{\k) is not uniquely defined, and the set of optimal solutions 
for each A^ is a straight line in MP. 

• For any A € (Afc,Afc+i), the solution /3(A) is fixed and equal to the min- 
imum ii norm solution for Afc and the maximum ii norm solution for 



(23) 




iia + r>0 
if a + r < 0, 



Proposition 3 generalizes observations on the path of solutions made in the 
context of quantile regression in [12] and in the context of 1-norm support 
vector machines in [21]. Note that this leads to describing a regularized path 
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which is piecewise constant as a function of the regularization parameter A, 
with jumps at the values Ai,...,Am- However, it is stiU piecewise linear 
in the ii norm of the solution, ||/3(A)||i. The algorithm for computing the 
solution path follows the spirit of our earlier work [21]. For brevity we omit 
the details. We note, however, that it is fundamentally different from the 
LARS-LASSO algorithm and Algorithm 1, because we are now dealing with 
a nondifferentiable loss function. 

An interesting variant of piecewise linear loss is to replace the ii loss 
with an £qq loss, which is also piecewise linear and nondifferentiable. It leads 
to interesting "mini-max" estimation procedures, popular in many areas, 
including engineering and control. For example, [19] proposes the use of ii- 
penalized £oo-loss solutions in an image reconstruction problem (but does 
not consider the solution path). Path- following algorithms can be designed 
in the same spirit as the ii loss case. 

6. Conclusion. In this paper we combine computational and statistical 
considerations in designing regularized modeling tools. We emphasize the 
importance of both appropriate regularization and robust loss functions for 
successful practical modeling of data. From a statistical perspective, we can 
consider robustness and regularization as almost independent desirable prop- 
erties dealing with different issues in predictive modeling: 

• Robustness mainly protects us against wrong assumptions about our error 
model. It does little or nothing to protect us against the uncertainty about 
our model structure which is inherent in the finiteness of our data. For 
example, if our errors really are normal, then squared error loss minimizes 
the asymptotic variance of the coefficients, no matter how little data we 
have or how inappropriate our model is [10]. Using robust loss in such a 
situation is always counter-productive. 

• Regularization deals mainly with the uncertainty about our predictive 
model structure by limiting the model space. Note, in this context, the 
equivalence between the "penalized" formulation (1) and a "constrained" 
formulation min^ L(y, XP) subject to J{P) < s. The two formulations share 
the same solution path. The constrained formulation exposes the goal of 
regularization as "simplifying" the model estimation problem by limiting 
the set of considered models. 

There are many interesting directions in which our work can be extended. 
We may ask, how can our geometric understanding of the regularized solu- 
tion paths help us to analyze the statistical properties of the models along 
the path? For example, [3] has offered analysis of the LASSO path. This 
becomes much more challenging once we stray away from squared error loss. 
We may also consider more complex penalty structure, such as local or data- 
dependent penalties [1] or multiple penalties [18]. 
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Finally, it is worth noting that limiting our discussion to convex problems, 
for which efficient algorithms can be designed, leaves out some other statisti- 
cally well motivated fitting approaches. The use of a nonconvex penalty was 
advocated by Fan and collaborators in several papers [4, 5]. They expose 
the favorable variable selection property of the penalty function they offer, 
which can be viewed as an improvement over the use of ii penalty. [16] ad- 
vocates the use of nonconvex -^-loss in the classification setting, minimizing 
the effect of outliers and misclassified points. This approach can be viewed 
as an even more robust version of our Huberized loss function, with strong 
statistical motivation in terms of asymptotic behavior. 

Acknowledgments. We thank the referee. Associate Editor and Co-Editors 
J. Fan and J. Harden for their thoughtful and useful comments, and in par- 
ticular for introducing us to the relevant literature on total variation penal- 
ties. We thank B. Efron, J. Friedman, T. Hastie, R. Tibshirani, B. Yu and 
T. Zhang for their helpful comments and suggestions. 

REFERENCES 

[1] Davies, p. L. and KovAC, A. (2001). Local extremes, runs, strings and multireso- 
lution (with discussion). Ann. Statist. 29 1-65. MR1833958 

[2] DoNOHO, D., Johnstone, I., Kerkyacharian, G. and Picard, D. (1995). Wavelet 
shrinkage: Asymptopia? (with discussion). J. Roy. Statist. Soc. Ser. B 57 301- 
369. MR1323344 

[3] Efron, B., Hastie, T., Johnstone, I. M. and Tibshirani, R. (2004). Least angle 
regression (with discussion). Ann. Statist. 32 407-499. MR2060166 

[4] Fan, J. and Ll, R. (2001). Variable selection via nonconcave penalized likelihood and 
its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. MR1946581 

[5] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging 
number of parameters. Ann. Statist. 32 928-961. MR2065194 

[6] Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algo- 
rithm. In Proc. 13th International Conference on Machine Learning 148-156. 
Morgan Kauffman, San Francisco. 

[7] Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). The entire regulariza- 
tion path for the support vector machine. J. Mach. Learn. Res. 5 1391-1415. 

[8] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statis- 
tical Learning: Data Mining, Inference and Prediction. Springer, New York. 
MR1851606 

[9] HOERL, A. and Kennard, R. (1970). Ridge regression: Biased estimation for 

nonorthogonal problems. Technometrics 12 55-67. 
[10] HuBER, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 

35 73-101. MR0161415 
[11] KOENKER, R. (2005). Quantile Regression. Cambridge Univ. Press. MR2268657 
[12] KOENKER, R., Ng, P. and Portnoy, S. (1994). Quantile smoothing splines. 

Biometnka 81 673-680. MR1326417 
[13] Mammen, E. and van de Geer, S. (1997). Locally adaptive regression splines. Ann. 

Statist. 25 387-413. MR1429931 



20 



S. ROSSET AND J. ZHU 



[14] Osborne, M., Presnell, B. and Turlach, B. (2000). On the LASSO and its dual. 

J. Comput. Graph. Statist. 9 319-337. MR1822089 
[15] RosSET, S., Zhu, J. and Hastie, T. (2004). Boosting as a regularized path to a 

maximum margin classifier. J. Mach. Learn. Res. 5 941-973. 
[16] Shen, X., Tseng, G., Zhang, X. and Wong, W. H. (2003). On ^-learning. J. 

Amer. Statist. Assoc. 98 724-734. MR2011686 
[17] TiBSHlRANi, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. 

Statist. Soc. Ser. B 58 267-288. MR1379242 
[18] TiBSHiRANi, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Spar- 

sity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 

67 91-108. MR2136641 
[19] TSUDA, K. and Ratsch, G. (2005). Image reconstruction by linear programming. 

IEEE Trans. Image Process. 14 737-744. 
[20] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York. 

MR1367965 

[21] Zhu, .J., Rosset. S., Hastie, T. and Tibshirani, R. (2003). 1-norm support vector 
machines. In Advances in Neural Information Processing Systems 16. 



Predictive Modeling Group 

IBM T. J. Watson Research Center 

YORKTOWN Heights, New York 10598 

USA 

E-MAIL: srossetOus. ibm.com 



Department of Statistics 
University of Michigan 
1085 South University 
Ann Arbor, Michigan 48109 
USA 

E-MAIL: jizhu@umich.edu 



